Generalize distributed PyTorch training example #588

jayeshmahajan · 2026-01-26T02:49:25Z

Summary

This PR adds a new example demonstrating distributed training with PyTorch's Distributed Data Parallel (DDP) on Kubernetes. The example showcases multi-node, multi-GPU training using Kubernetes Jobs with comprehensive support for major cloud providers (GKE, EKS, AKS) and on-premises deployments.

What This Example Demonstrates

Distributed Data Parallel (DDP) Training: Multi-node, multi-GPU PyTorch training using DDP
Kubernetes Jobs with Indexed Completion: Coordinated parallel training workers using completionMode: Indexed
Pod-to-Pod Communication: Headless Services for stable DNS-based worker discovery
Persistent Storage: PVCs for training data and model checkpoints
Workload-Aware Scheduling: Integration with Kubernetes v1.35+ workload scheduling (optional)

Key Features

1. Distributed Training Setup

Uses PyTorch DDP for gradient synchronization across workers
Automatic rank assignment from Kubernetes Job completion index
Master worker discovery via headless Service DNS
DistributedSampler for data sharding across workers

2. Kubernetes Resources

Job: Indexed completion mode for stable pod naming and rank assignment
Headless Service: Enables direct pod-to-pod communication
PersistentVolumeClaims: Separate volumes for training data and outputs
ConfigMaps: Training script and hyperparameters
Workload: Workload-aware scheduling support (Kubernetes v1.35+)

3. Multi-Cloud and On-Premises Support

Base configuration: Generic setup that works across environments
Kustomize overlays: Provider-specific configurations for:
- Google Kubernetes Engine (GKE)
- Amazon Elastic Kubernetes Service (EKS)
- Azure Kubernetes Service (AKS)
- On-premises Kubernetes
Comprehensive comments explaining cloud-specific vs generic configurations
Storage class guidance for different deployment scenarios

4. Training Script

Simple CNN model for CIFAR-10 classification
Automatic dataset download (CIFAR-10)
Checkpoint saving at each epoch
TensorBoard logging support
Proper DDP initialization and cleanup

Files Included

training-job.yaml - Main Kubernetes Job configuration
train.py - PyTorch DDP training script
training-script-configmap.yaml - Training script as ConfigMap
service.yaml - Headless Service for pod communication
data-pvc.yaml / output-pvc.yaml - Persistent storage
train-config.yaml - Training hyperparameters
workload.yaml - Workload-aware scheduling configuration
kustomization.yaml - Kustomize base configuration
README.md - Comprehensive documentation

…premises support This PR generalizes the PyTorch distributed training example to support multiple cloud providers (GKE, EKS, AKS) and on-premises Kubernetes deployments. The changes make cloud-specific configurations explicit through comments while maintaining backward compatibility and adding clear guidance for different deployment environments. Key changes: - Added on-premises Kubernetes nodeSelector examples and reorganized cloud provider configurations - Added comprehensive comments explaining storage access modes and StorageClass options - Updated documentation to cover all major cloud providers and on-premises deployments equally Benefits: - Multi-cloud support with clear guidance for GKE, EKS, AKS, and on-premises - Better documentation with comprehensive comments - Easier adoption with environment-specific configuration examples - Backward compatible - all existing configurations remain functional

k8s-ci-robot · 2026-01-26T02:49:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jayeshmahajan
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2026

k8s-ci-robot requested review from janetkuo and soltysh January 26, 2026 02:49

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize distributed PyTorch training example #588

Generalize distributed PyTorch training example #588

jayeshmahajan commented Jan 26, 2026

Uh oh!

k8s-ci-robot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Generalize distributed PyTorch training example #588

Are you sure you want to change the base?

Generalize distributed PyTorch training example #588

Conversation

jayeshmahajan commented Jan 26, 2026

Summary

What This Example Demonstrates

Key Features

1. Distributed Training Setup

2. Kubernetes Resources

3. Multi-Cloud and On-Premises Support

4. Training Script

Files Included

Uh oh!

k8s-ci-robot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants